AITopics

Country: Europe (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsMar-17-2026, 11:05:55 GMT

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Neural Information Processing SystemsNov-21-2025, 15:21:42 GMT

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(lambda), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of off-policyness; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyse the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to Q* without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(lambda), which was an open problem since 1989. We illustrate the benefits of Retrace(lambda) on a standard suite of Atari 2600 games.

algorithm, efficient off-policy reinforcement learning, name change, (3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Neural Information Processing SystemsAug-20-2025, 01:18:29 GMT

c2073ffa77b5357a498057413bb09d3a-Paper.pdf

algorithm, constraint, dataset, (15 more...)

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsAug-18-2025, 23:51:22 GMT

A Proofs

In this proof, we use the notion of weighted exchangeability as defined in Section 3.2 of [27]. A.2 Proof of Proposition 4.2 The following proof is an adaptation of [14, Proposition 1] to our setting. To get from (32) to (33), we use Assumption 2 and Markov's inequality. B.1 Further comments on the differences between [14] and COPP In this subsection, we elaborate on the differences between our work and [14]. As mentioned in in the main text, given that we are integrating out the action in Eq. 7, we are essentially able to use the full dataset when constructing the CP intervals.

artificial intelligence, machine learning, target policy, (19 more...)

Technology:

Information Technology > Data Science (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

arXiv.org Artificial IntelligenceMar-3-2025

AVG-DICE: Stationary Distribution Correction by Regression

Che, Fengdi, Chan, Bryan, Ma, Chen, Mahmood, A. Rupam

Off-policy policy evaluation (OPE), an essential component of reinforcement learning, has long suffered from stationary state distribution mismatch, undermining both stability and accuracy of OPE estimates. While existing methods correct distribution shifts by estimating density ratios, they often rely on expensive optimization or backward Bellman-based updates and struggle to outperform simpler baselines. We introduce AVG-DICE, a computationally simple Monte Carlo estimator for the density ratio that averages discounted importance sampling ratios, providing an unbiased and consistent correction. AVG-DICE extends naturally to nonlinear function approximation using regression, which we roughly tune and test on OPE tasks based on Mujoco Gym environments and compare with state-of-the-art density-ratio estimators using their reported hyperparameters. In our experiments, AVG-DICE is at least as accurate as state-of-the-art estimators and sometimes offers orders-of-magnitude improvements. However, a sensitivity analysis shows that best-performing hyperparameters may vary substantially across different discount factors, so a re-tuning is suggested.

density ratio, discount factor, trajectory, (16 more...)

2503.02125

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Wang, Mianchu, Jin, Yue, Montana, Giovanni

Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning

arXiv.org Artificial IntelligenceDec-4-2024

Offline reinforcement learning (RL) enables policy learning from static datasets, without active environment interaction, making it ideal for high-stakes applications like autonomous driving and robot manipulation [Levine et al., 2020, Ma et al., 2022, Wang et al., 2024a]. A key challenge in offline RL is managing the discrepancy between the learned policy and the behaviour policy that generated the dataset. Small discrepancies can hinder policy improvement, while large discrepancies push the learned policy into uncharted areas, causing significant extrapolation errors and poor generalisation [Fujimoto et al., 2019, Yang et al., 2023]. Addressing these challenges, existing research has proposed various solutions. Conservative approaches penalise actions that stray into out-of-distribution (OOD) regions [Yu et al., 2020, Kumar et al., 2020], while others regularise the policy by minimising its divergence from the behaviour policy, ensuring better fidelity to the dataset [Fujimoto and Gu, 2021, Wu et al., 2019].

action distribution, behaviour policy, dataset, (11 more...)

2412.03258

Country:

North America > United States > Montana (0.05)
North America > United States > New York (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (0.34)
Information Technology (0.34)
Automobiles & Trucks (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceNov-11-2024

Streetwise Agents: Empowering Offline RL Policies to Outsmart Exogenous Stochastic Disturbances in RTC

Soni, Aditya, Das, Mayukh, Parayil, Anjaly, Ghosh, Supriyo, Shandilya, Shivam, Cheng, Ching-An, Gopal, Vishak, Khairy, Sami, Mittag, Gabriel, Hosseinkashi, Yasaman, Bansal, Chetan

The difficulty of exploring and training online on real production systems limits the scope of real-time online data/feedback-driven decision making. The most feasible approach is to adopt offline reinforcement learning from limited trajectory samples. However, after deployment, such policies fail due to exogenous factors that temporarily or permanently disturb/alter the transition distribution of the assumed decision process structure induced by offline samples. This results in critical policy failures and generalization errors in sensitive domains like Real-Time Communication (RTC). We solve this crucial problem of identifying robust actions in presence of domain shifts due to unseen exogenous stochastic factors in the wild. As it is impossible to learn generalized offline policies within the support of offline data that are robust to these unseen exogenous disturbances, we propose a novel post-deployment shaping of policies (Streetwise), conditioned on real-time characterization of out-of-distribution sub-spaces. This leads to robust actions in bandwidth estimation (BWE) of network bottlenecks in RTC and in standard benchmarks. Our extensive experimental results on BWE and other standard offline RL benchmark environments demonstrate a significant improvement ($\approx$ 18% on some scenarios) in final returns wrt. end-user metrics over state-of-the-art baselines.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2411.06815

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Washington > King County > Renton (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceOct-11-2024

Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

Wang, Wenlong, Dusparic, Ivana, Shi, Yucheng, Zhang, Ke, Cahill, Vinny

Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often demands complex and deep architectures, which are expensive to compute and train. Within the world model, dynamics models are particularly crucial for accurate predictions, and various dynamics-model architectures have been explored, each with its own set of challenges. Currently, recurrent neural network (RNN) based world models face issues such as vanishing gradients and difficulty in capturing long-term dependencies effectively. In contrast, use of transformers suffers from the well-known issues of self-attention mechanisms, where both memory and computational complexity scale as $O(n^2)$, with $n$ representing the sequence length. To address these challenges we propose a state space model (SSM) based world model, specifically based on Mamba, that achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies and facilitating the use of longer training sequences efficiently. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early stages of training, combining it with the aforementioned technique to achieve a normalised score comparable to other state-of-the-art model-based RL algorithms using only a 7 million trainable parameter world model. This model is accessible and can be trained on an off-the-shelf laptop. Our code is available at https://github.com/realwenlongwang/drama.git.

dynamic model, reinforcement learning, world model, (13 more...)

2410.08893

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.85)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMar-23-2024

A cGAN Ensemble-based Uncertainty-aware Surrogate Model for Offline Model-based Optimization in Industrial Control Problems

Feng, Cheng

This study focuses on two important problems related to applying offline model-based optimization to real-world industrial control problems. The first problem is how to create a reliable probabilistic model that accurately captures the dynamics present in noisy industrial data. The second problem is how to reliably optimize control parameters without actively collecting feedback from industrial systems. Specifically, we introduce a novel cGAN ensemble-based uncertainty-aware surrogate model for reliable offline model-based optimization in industrial control problems. The effectiveness of the proposed method is demonstrated through extensive experiments conducted on two representative cases, namely a discrete control case and a continuous control case. The results of these experiments show that our method outperforms several competitive baselines in the field of offline model-based optimization for industrial control.

control parameter, control result, ensemble, (16 more...)

2205.0725

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Electrical Industrial Apparatus (1.00)

Technology:

Information Technology > Modeling & Simulation (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)